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RELIABILITY WERE FAIR, (4) WITH IN-OBSERVER RELIABILITY WERE 
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The Teacher Practices Observation Record (TPOR) is an instrument for 
measuring classroom behavior by systematic observation. It attempts to 
measure the agreement-disagreement of teachers* observed classroom behavior 
with educational practices advocated by John Dewey in his philosophy of 
experimental ism. In addition to presenting this instrument and briefly 
describing its development, we will report the reliability data obtained 
by using it in a study of observations of filmed teaching episodes. The 
data reported on the TPOR will be placed in the context of the general 
problem of studying reliability, end will be used to demonstrate a new 
design for estimating the reliability of such observat ional measurements. 

The TPOR was developed in conjunction with the Personal Beliefs 
Inventory and the Teacher Practices Inventory , which attempt to measure 
teacher beliefs with respect to Dewey's experimental ism. ^ The value of 



Bob Burton Brown, The Experimental Mind In Education (New York; 
Harper and Row, in press with expected publication date September of 
1967 ). See also Brown's "The Relationship of Experimental ism to Class- 
room Practice." Unpublished Ph.D. thesis. Madison; The University of 
Wisconsin, 1962. 
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these three instruments to educational research is not that they measure 
agreement-d isagreement with Dewey's philosop-hy but that they permit 
comparable measurements of beliefs and practices in terms of a common 
theoretical referent. It is this connection with companion measurejTjents 
of beliefs which differentiates the TPOR from most other instruments for 
recording observations of classroom behavior. Likewise, its capability 
for measuring and comparing observed teacher behavior with logically 
congruent criteria for judging teacher competence gives the TPOR a key 
function in our "Investigation of Observe r-Judge Ratings of Teacher 
Competence" at the University of Florida, a four-year research project 
funded by the U, S, Office of Education, 



TEACHER PRACTICES OBSERVATION RECORD 
The directions for the use of the Teacher Practices Observation 
Record are as follows: 

The Teacher Practices Observation Record provides a 
framework for observing and recording the classroom practices 
of the teacher. Your role as an observer is to watch and 
listen for signs of the sixty-two teacher practices listed 
and to record whether or not they were observed, WITHOUT 
MAKING JUDGMENTS AS TO THE RELATIVE IMPORTANCE OR RELEVANCE 
OF THOSE PRACTICES. 

There are three (3) separate 10-minute observation and 
ma r l< i ng per i 6^ s“TiT^e a^“"3^ -mt nu te““v i^^^^ 

classroom. These are indicated by the column headings I, II, 
and III, During period I, spend the first 5 minutes observing 
the behavior of the teacher. In the last 5 minutes go down 
the list and place a check (<x) mark In Column I beside all 
practices you saw occur. Leave blank the space beside 
practices which did not occur or which did not seem to 
apply to this particular observation. Please consider 
every practice listed, mark it or leave it blank, A par- 
ticular item Is marked only once in a given column, no 
matter how many times that practice occurs within the 10- 
minute observation period, A practice which occurs a dozen 
times gets one check mark, the same as an item which occurs 
only once. 





















ERIC 




3 

Repeat this process for the second lO-mlnute period, 
marking! in Column II. Repeat again for the third iO-mlnute 
period, marking In Column III. Please add the total number 
of check marks recorded for each teacher practice and record 
In the column headed TOT. There may be from 0 to 3 total 
check marks for each Item. 

The revised form of the Teacher Practices Observation Record Is 
presented below. It contains 62 items or *'slgns“ of teacher practices. 

V/lth respect to Dewey's philosophy of experimental ism, 31 of these are 
positive and 31 ore negative. All even-numbered items are positive and 
all odd-numbered items are negative, making it easy to score the results. 

The Teacher Practices Observation Record is usually scored by first 
totaling the number of check marks for each Item, placing either a 0, 1, 

2^ or 3 in the column headed TOT. Next, the totals for all of the odd- 
numbered items are reversed , changing 0 to 3 » 1 to 2, 2 to 1, and 3 to 0. 
Then by adding the totals for al 1 items (both the totals for the untouched 
even or "positive" Items and for the adjusted odd or "negative" Items) we 
get a net score, A maximum score of 186 Indicates complete experimental ism 
and a minimum score of 0 indicates complete non-experimental Ism, A score 
of 94 or above Indicates the observed teacher practices are more experi- 
mental than non-experimental, and a score of 93 or below indicates the 
oppos i te . 
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TEACHER PRACTICES OBSERVATION RECORD 



TEACHER PRACTICES 



TOT 



II III A. NATURE OF THE SITUATION 

I. T makes self center of attention. 

2. T makes p center of attention. ^ 

3. T makes some J±LlQgt itself center of p*s attention. 

4. T makes doi ng something ce nter of p*s attention. 

5. T has p spend time waiting, watching, listening. 

6. T has p participate actively. - 

7. T remains aloof or detached from o's activities. 

8. T joins or participates in p*s activities. 

9. T discourages or prevents p from expressing self freely. 

10. T encourages p to express self freely. 



B. NATURE OF THE PROBLEM 

1 1 . T organizes learning around Q, posed by T. 

12. T organizes learning around p's own problem or g. 

13. T prevents situation which causes p doubt or perplexity. 

|4. T involves p in uncertain or incomplete situation. 

15. T steers p away from “hard‘* Q or problem. 

16. T leads p to Q or problem which **stumps** him. 

17. T emphasizes gentle or pretty aspects of topic. 

18. T emphasizes distressing or ugly aspects of topic. 

19. T asks 0, that p can answer only if he studied the lesson. 

20. T asks Q that is not readily answerable by study of 



lesson. 



C. DEVELOPMENT OF IDEAS 

21. T accepts only one answer as being correct. 

22. T asks p to suggest additional or alternative answers. 

23. T expects p to come up with answer T has in mind. 

24. T asks p to judge comparative value of answers or 

suggestions. 

~ 2 ST-T~expex:trs '-p”-to~"’^-know*^ -fetheT "than ~to <jtiess answeT— -te 0,. 

26. T encourages p to guess or hypothesize about the unknown 

or untested. ^ 

27. T accepts only answers or suggestions closely related to 

top i c . . 

28. T entertains even *'wi ld*‘ or far-fetched suggestion of p, 

29. T lets p *'get by‘* with opinionated or stereotyped answer. 

30. T asks p to support answer or opinion with evidence. 






ERIC 






TOT 


1 
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0. USE OF SUBJECT MATTER 










31. T collocts and analyzes subject matter for p. 










32. T has p make his own collection and analysis of subject 
matter. 










33. T provides p with detailed facts and information. 










34. T has p find detailed facts and information on his own. 










35. T relies heavily on textbook as source of information. 










36., T makes a wide ranqe of informative material available. 










37. T accepts and uses inaccurate information. 










38. T helps p discover and correct factual errors and 
inaccuracies. 










39. T permits formation of misconceptions and over- 
qeneral izat ions. 










4o. T questions misconceptions, faulty logic, unwarranted 
conclusiohs. 










E. EVALUATION 










4l . T passes judgment on p's behavior or work. 


* 








42. T withholds judgment on p's Ijehavior or work. 










43. T stops p from going ahead with plan which T knows will 
fail. 










44. T encourages p to put his Ideas to a test. 










45. T immediately reinforces p's answer as "right" or "wrong'f 






1 


1 "" 

1 


46. T has p decide when Q. has been answered satisfactorily. 








— 


47. T asks another p to give answer if one p fails to answer 
quickly. 










48. T asks p to evaluate his own work. 










49. T provides answer to p who seems confused or puzzled. 










50. T gives p time to sit and think, mull things over. 










F. DIFFERENTIATION 










51. T has all p working at same task at same time. 










52, T has different p working at different tasks. 










53. T holds all p responsible for certain material to be 
learned. 










54. T has p work independently on what concerns p. 










55. T evaluates work of all p by a set standard. 










56, T evaluates work of different p by different standards. 










G. MOTIVATION. CONTROL 










57. T motivates p with privileges, prizes, grades. 










58. T motivates p with intrinsic value ’^of ideas or activity. 










59. T approaches subject matter, In direct, business-like way. 










60. T approaches subject matter in indirect, informal way. 










61. T imposes external disciplinary control on p. 










62. T encourages self-discipline on part of p. 
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FILM STUDIES 

The original 70-item form of the TPOR was used in the spring of \S6k 
for recording observations of five filmed teaching episodes by a large 
number of observer- judges at four different teacher education institutions 
in California, Illinois, New York, and V/isconsin. A year later TPOR 
observations were repeated on two of these films by the same observer- 
judges. These data were used to give us information about the consistency- 
stability reliability of the TPOR- 

The teaching episodes observed in this study were originally filmed 
at Madison, Wi scons in, in the early 1960's, For the purpose of this study 
30-minute continuous and uninterrupted segments were cut from unedited 
films which were 50 to 60 minutes in length. Selection of the films and 
the segments taken from them was made for purposes of achieving variety 
in teaching style, and in grade level and subject taught. Teachers in 
the film were equally well trained (all had master's degrees) and had 
been selected for filming at the University of Wisconsin as "showcase" 
teachers. Film#l was of a ninth-grade French class; Film ^2, a seventh- 
grade mathematics class; Film/^3» a fourth-grade unit on "Weather"; Film #4, 
a ninth-grade speech class; and Film #5, a seventh-grade science class. 

The observer- judges were drawn from the faculties of two large 
midwestern universities and two large state "teachers college-type" 
schools— one in the east and one in the far west. The observer-judges 
included student teaching supervisors, education professors, and professors 
of academic subjects who volunteered their participation in the project. 

None of them had seen the films or the TPOR prior to the viewing sessions, 
held separately at the four different campuses over a span of six weeks. 
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Conditions of the viewing sessions were simitar. All observer- judges 
received the same 10-minute explanation, by the same person, for recording 
their observations in the TPOR. During the viewing of Film #1 time was 
called periodically for the observer- judges and lights were switched on 
and off to make it easier for them to become familiar with the observational 
procedures and instrumentation. This constituted the sum total of "training" 
provided the observers. No attempt was made to bring them to any sort of 
agreement with respect to their recorded observations, nor was any discussion 
to this effect permitted. Assistance with respect to time and lighting was 
discontinued after the first film observation, putting the observers "on 
their own" in every respect. 

Table I shows the mean TPOR score given each of the five films by the 
observer- judges on the first viewing. The French teacher in Film No, 1 was 
seen as the least experimental and the fourth-grade teacher in Film No. 3 
as the most in agreement with Dewey. The range of more than kO points 
between the high and low TPOR means indicates the ability of the Instrument 
to differentiate various styles of teaching. 

TABLE I 

Mean-TPGR-^oorea "Siv€f{~Five~Fi-lfiis 

by All Observers 



Film 


No. of 
Observers 


Mean 


S. D. 


No. 1 


130 


80.01 


13.32 


No. 2 


124 


113.86 


16.84 


No. 3 


119 


120.96 


22.74 


No. 4 


119 


104.24 


17.10 


No. 5 


67 


98.84 


12.88 
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We looked for differences in the TPOR scores given at the four dif- 
ferent participating institutions. The location variable was found to 
have little or no influence. Using Scheffe*s comparisons, no statistically 
significant differences were found among the TPOR means given at the 
various locations for Film Nos. I, 2, 4, and 5. The only statistically 
significant differences were found between California and each of the other 
three locations on Film No. 3. 

We also looked for differences in the TPOR scores given by the three 
major occupational classifications of observer-judges— college supervisors 
of student teaching, education professors, and academic professors. No 
statistically significant differences were found between any of these 
groups for Film Nos. I, 2, 4, and 5. The only statistically significant 
differences were found between supervisors of student teaching and both 
education and academic professors on Film No. 3. 

TPOR means were also examined in relation to the evaluative judgments 
made about the quality of teaching observed in the films. Table M shows 
an Interesting pattern of correlation between TPOR scores and ratings given 
each film. While this could mean that the TPOR scores were Influenced by 
how-much the obsoTver~^i1<exi“ what~he~isew,~'the~xonvei^e "is nmore i rife ly true . ” 
The wide differences In TPOR means within each of the evaluative categories 
arc evidence that the correlation between TPOR scores and ratings is relative 
within the limits describing each individual film. In this study, a given 
TPOR score did not guarantee a "good" or "bad" rating, even though In 
every case the higher the r&ting, the higher the TPOR mean score. 
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TABLE II 

The Relationship Between TPOR Means and Evaluative 
Ratings of Five Filmed Teaching Episodes 



Evaluative Ratings 





A 


B 


1 

i 

C 


f 

D 


E 


F 


Fi Im 


Out- 

standing 


Very 

Good 


Good 


Fair 


Poor 


In- 

competent 


No. 1 


88.64 

(ID 


82.21 

(48) 


79.45 

(33) 


67.89 . 
(9) 


(0) 


(0) 


No. 2 


126.47 

(19) 


118.57 

(56) 


109.19 

(21) 


_ Coj 


M 


(0) 


No. 3 


138.19 

(27) 


119.32 

(38JL„ 


109.91 

(23) 


85.00 
(D — 


(0) 


(0) 


No. 4 


110.69 

(29) 


106.86 

(43) 


93.73 

(ID 


76.50 

(2) 


75.50 

(21— j 


(Q) 


No. 5 


115.56 

J21 


103.39 

(13) 


96.00 

MI— 


88.67 

(61_ 


65.00 

.('.) 


(0) 



Statistically significant differences beyond the o05 level (using 



Scheffe's comparison procedures) were found for the following pairs of 
means: 

Evaluative Category A; Films I and 2, I and 3 end 4, (I and 4, 

I and 5 were very close) 

EvaT^ja t i ve Ca tegory B^ ---Fj J ros-.-I--and„.2^_ _L_.a nd. _..l-_e nji_4j 

Kvolijatlve Category C: Films I and 2, I and 3 (I 5 were c se) 

Film No. 3: Category A ond B. A and C 
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Mass observations of fUms are expensive and administratively diffi- 
cult to arrange. For these reasons repeated observations the second year 
could be obtained on only two of the five films. Film No. 1 was eliminated 
because It had been used as the "training" film and the conditions of the 
first viewing could not be simulated. Film No. 3 yielded a wide discrepancy 
between the scores given It In California and those given It at the other 
three locations, which we thought might be due to the artificial conditions 
under which it was filmed. Film No. 5 had not been observed at all four 
institutions. This left No. 2 and No. 4, which were selected by elimination 
for the second viewing. It was possible to obtain repeated TPOR scores 
on these two films by only a portion of those who observed the first 
viewings. 

Table Mi shows a fairly substantial difference between TPOR means 
recorded for the first and second viewings of Film No. 2. While this 
difference raises some questions about stability, both means for this sub- 
group of 69 observers lie well within one standard deviation of the mean 
of 115*86 for 119 first-viewing observers which may simply demonstrate 
the normal variabl 1 ity of TPOR scores. The differences between TPOR 
scores for the first and second viewings of Film No. 4 are very small. 

TABLE III 

Mean TPOR Scores Given Films On 
Repeated Observations One Year Apart 



No. 



Film 


Viev;inq 


Observers 


Mean 


S. D. 


No. 2 


1st 


69 


122.22 


20.52 


No. 2 


2nd 


69 


109*81 


18.31 


No. 4 


1st 


72 


107*15 


17*15 


No. 4 


2nd 


72 


105*14 


18.12 
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RELIABILITY 

In order for anyone to place confidence in the scores obtained with 
the TPOR, its reliability as a measuring instrument must be established. 
There are three major problems involved in doing this: 

1. Selecting types (or definitions) of reliability appropriate 
to the instrument and the purposes for which it is designed. 

2. Selecting a meaningful measure (or yardstick) of reliability 
once the type is specified. 

3. Selecting a good estimator of a given measure to give an 
estimate of reliability based on experimental data. 

Reliability can be a tricky concept. We know that reliability always 
refers to consistency throughout a series of measurements, and that it is 
usually expressed In terms of something called rel iabi I i ty coefficients. 
Rarely do we make clear what kind of consistency has been figured. Although 
everybody in educational research reads reliability coefficients, few seem 
to really understand (or care) what these mean or how they were obtained. 

All that matters is that they be high. Once the standard for "highness" 
has been debated and denoted, then surpassed or fallen short of, what more 
is there to say about reliability? 

There are many different kinds of reliability to be considered. 
Thorndike speaks of approaching the study of reliability from two quite 
different viewpoints. One approach is to be concerned about the actual 
or absolute magnitude of errors of measurements. In this case reliability 
is expressed in terms of the variability of scores obtained by repeated 
testing of the same individual, and is based on a statistic called 
standard error of m easurement . Another approach can be made in terms of 
the consistency with which individuals maintain the same relative position 
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In the total group on repetition of a measurement procedure. In this case 

consistency is expressed in terms of the correlation between two sets of 

2 

scores, called the coefficient of reliability . As a further example, 

2 

Robert L. Thorndike, "Reliability," Chapter 15 in Educational 
Measurement . E. F. Lindquist, Editor. (V/ashington, D. C.: American 
Council on Education, 1950), pp. 560*61. 

Cronbach points out that not all reliability coefficients reveal the same 

or even comparable information. He refers to "comparable-forms," "split- 

half," and "test-retest" reliability coefficients as ways to get at 

different aspects of reliability. The first is a "coefficient of 

equivalence and stability," the second a "coefficient of equivalence" 

3 

only and the third a "coefficient of stabi 1 1 ty ." Furthermore, we have 



Lee J. Cronbach, Essentials of Psychological Testing , Second Edition, 
(New York: Harper & Brothers, I960), pp. 136-142, 

something cal led internal cons istency or i tern reliability which assesses 
test homogeneity, or the extent to which all items measure the same 
attribute. This, of course, is a horse of still another color. All of 
which makes the use of the term "reliability" meaningless without some 
sort of further differentiation and definition. 

Dealing adequately with already difficult concepts of reliability 
becomes even more complex when one turns from cons iderat ion of tests of 
achievement and Intelligence, and the like, to the measurement of class- 
room behavior by systematic observation. The question of the reliability 
of the observers and the recording of their observations must be added 
to the problem. |n the past most observational studies have limited their 








i 
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stusivy of reliability to computing the correlation between two sets of 
observations or to figuring the percent of agreement between observers. 

Keeping this tradition, in part, we computed the correlations between 
the TPOR scores obtained from the repeated observations of Filims No. 2 and 
No^-4". It is curious to note in Table IV that the correlations of the 
columns (10-minute observation periods) within each film observation are 
very high, but the correlations between the 1964 and 1965 observations 
are very low. The first indicates that the observers tended to maintain 

TABLE IV 

Correlation of TPOR Scores Obtained from 
Repeated Observations of Films 



FILM NO. 2 



TPOR 




i 1 I • — 

1964 Observation i 


1965 Observation 


TPOR Column 


TPOR Column 


Col 


lumn 


1 


2 


3 


TOT 


1 


2 


3 


TOT 


1 


c 


1 


l.QO 


.79 


.69 


.89 


.36 


.25 


.12 


.27 


VO Q) 


o 


2 


-- 


1.00 


.81 


.95 


*m 9m ' 


.29 


.16 


.31 


CA V) 
— -Q 


•M 

fO 


3 






1.00 


.92 


a ms 




.20 


.29 


O 


> 


TOT 


— 


— 


B mm 


1.00 


— 


«K mm 


— 


.32 


9 


c 


1 




-- 


»- 


-- 


).00 


.61 


.55 


.80 


UTi 

0) 


o 


2 


— 


— 


-- 


-- 


-- 


1.00 


.81 


.93 


G\ cn 
^ -Q 


•M 

ro 


3 


— 


— 


— 




-- 


— 


1.00 


.90 


O 


> 


TOT 










{■■ 1 " " 






1.00 



FILM NO. 4 







1964 Observation 


1965 Observation 


TPOR 




TPOR Column 


TPOR Column 


Go 1 umn 


1 


2 


3 


TOT 


1 


2 


3 


TOT 


1 c 


1 


1.00 


.75 


.52 




”. 32 ” 


.36 


.25 


.34 


~jr «- o 
vo <0 — 


2 


— 


1.00 


.71 


.93 




.46 


.52 


.52 


</> -t-* 
•— JQ <0 


3 


— 


— 


1.00 


.85 


— 


— 


.67 


.57 


O > 


TOT 


— 


-- 


-- 


1.00 


M M 


— 


— 


— 


1 c 


1 


— 


— 


— 


mm B 


1.00 


.79 


.71 


.90 


u o 

vO O — 


2 


— 


— 


-- 


— 


— 


1.00 


.83 


.95 


^ ^ <0 


3 


-- 


-- 


— 


-- 


— 


-- 


1.00 


.92 


o > 


TOT 




-- 


-- 


-- 


-- 


- - 


-- 


1.00 



the same relative position in the group throughout the viewing of a single 
film on a given day. The second indicates that sizeable shifts in these 
positions took place during the intervening year. In other words we got 
good consistency within one occasion or viewing, and again within another, 
but poor stability between two widely separated occasions. One must keep 
In mind, however, that such reliability coefficients normally decline 
proportionately with the length of time between '‘tests,” Had the repeat 
observations been made only a month or so apart we might expect consid- 
erably higher correlations. 

Even so, correlation of two sets of scores by a number of different 
observers Is not likely to be a very accurate estimate of reliability. 

It is difficult to make arrangements for large numbers of observers to 
view the same classroom on two different occasions, or to control variations 
between those occasions. Likewise, the number of classrooms observed on 
two different occasions by two different observers is likely to be small. 

In either case, the size of the N determines the precision of the correla- 
tion coeff i dent, and since the N of even well-financed observational 
studies rarely exceeds 100 the confidence intervals for the coefficients 
are extremely wide. Furthermore, such correlations are usually based on 
total scores which ignore variations In scoring Individual items or 
categories. It is possible to obtain a perfect correlation of total 
scores when the reliability for the items is zero. If on a 70- item 
"sign” system, for example, the 35 odd-numbered items are marked 
and the 35 even-numbered Items are marked ''0” on the first observation, 
and then exactly, reversed on the second observation, identical total 
scores will be obtained and used to produce a deceivingly perfect 
reliability correlation. 
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Percent of agreement between observers tells almost nothing about 
the accuracy of the scores obtained. It Is entirely possible to find 
observers agreeing 99 percent In recording behaviors on an Instrument 
whose Item or category consistency Is very poor. Reliability can be low 
even though observer agreement Is high for several reasons. For example, 
observers might be able to agree perfectly that a particular teaching 
prar.lce occurred In a classroom, yet If that same practice occurs equally, 
or nearly so. In all classrooms, the reliability of that Item as a measure 
of differences between teachers will be zero. Near-perfect agreement 
could also be reached about the percentage of time a number of teachers 
employed certain categories of behavior; but If every teacher sharply 
reversed these percentages from period to period or day to day, the 
reliability of these categories would be zero. Errors arising from varia- 
tions In behavior from one situation or occasion to another can far 
outweigh errors arising from failure of two observers to agree exactly 
In their records of the same behavior. 

Yet, the reliability of most Instruments for systematically recording 
the behavior of teachers. Including Flanders' well-known Classroom Inter- 
action Analysis, requires a high percent of observer agreement, “Between- 
observer" agreement has become almost a cardinal principle In planning 
observational studies. According to Medley and Mitzel a sample of class- 
rooms from the population to be studied should be visited by trained 
recorders using the observational Instrument In the same way It will be 
used In any subsequent study. In order to study the "objectivity" of the 
Items, i.e., how closely observers agree In recording Identical behaviors, 
at least two recorders should be present on each visit, sitting In 



o 
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different parts of the room and making Independent records. In order to 

be able to estimate how stable the two records based on different visits 

will agree, each class should be visited at least twice. To recapitulate. 

In their words, "c teachers are visited In £ situations by a team of jr 

recorders; In studying the reliability of a scale with items on It, 

4 

the total number of scores to be analyzed will be cHs." 

Donald M. Medley and Harold E. Mitzel "Measuring Classroom Behavior 
by Systematic Observation," Chapter 6 In Handboojc of Research on Jeactmjq, 
N. L. Gage, Editor (Chicago: Rand McNally & Company, 1963) f P* 309» 

To match this rigorous plan for data collection Medley and Mitzel 

have taken the classic definition of reliability, “'o'x 

and applied it to measurements of classroom behavior. In this definition, 
true variation, o ^ is defined to be the variation of the total score 
for any class (teacher) when the effects of recorders (observers), items 
on the scoring instrument, and situations (viewings or visits) have been 
removed. The true variation plus "errors" , is defined to be the 

variation of the total scores for any class. Including variation contribu- 
ted by Items on the scoring instrument, recorders, situations and random 
error. The smaller the effect of the recorders. Items, and situations 
for a class total, the higher the reliability coefficient will be. in 
other words, if the instrument has high reliability the scoring of the 
class or teacher Is relatively free of the effects of recorders, items, 
or the different situations under which the scoring was done, and as such, 

5 

reflects a "good" or reliable Instrument. 

^Ibid. 
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In seeking a design for estimating the reliability of TPOR observations, 
we closely examined the four-way analysis of variance model suggested by 
Medley and Mitzel. While we found it to be a sound approach to reliability 
estimation, it may not be entirely appropriate for analyzing the data 
obtained in the film study described above. For instance, in the simple 
example given by Medley and Mitzel in the Handbook of Research on Teaching , 
page 316 , where one item is used to score 24 classes (teachers) observed 
during four situations by two recorders (observers), the reliability 
coefficient is estimated by: 

^ MScxr 

Pxx = I - 

Where MS^xr 'S the mean square for classes x recorders obtained from 
the analysis of variance table and MS^. is the mean square for classes 
obtained from the analysis of variance table. The coefficient of relia- 
bility in this case actually reflects not instrument reliability , but 
rather, recorder or observer reliability . When MS^^xr large, it 
indicates an inconsistency on the part of the observers to score the 
classes in the same way, which in turn causes Pxx 1^® small. In like 
manner, a very small value of MS^-j^,. reflects consistency in scoring, in 
which case l^xx^ill be large. 

Training of the observers undoubtedly would bring them into agreement 
with respect to recording or scoring identical behaviors, which would be 
reflected in a higher reliability coefficient, However, in the 

previously described film study in which the TPOR was tried out, no 
attempt was made to train the observers. To the contrary, we deliberately 
tried to preserve the differences among observers by selecting them from 
varying occupational groups, from varying sizes of institutions with 
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varying orientations to teacher education, and from varying parts of the 
country. We wanted to test the reliability of the TPOR under uncontrolled 
field conditions to see what value it might have in the hands of the dif- 
fering kinds of people who carry out the everyday responsibilities for 
teacher education in America. Hence, the component of variance due to 
the observers' variability in our study would cause to be large 

compared to resulting in a small We did not get as much 

observer variability as might have been expected, however. When the 
Medley-Mi tzel model was adapted to fit our film study data the TPOR 
observations were found to have a modest but substantial reliability 
coefficient of .57. 

In the analysis of variance example cited above it should also be 
noted that two of the variables of interest, viz., classes and situations, 
had but one degree of freedom each. This being the case, "poor" estimates 
of the components of variance could result. In fact, the components of 
variance could be estimated to be zero (which happens in many cases). 

Also, since the estimate of p would consist of the ratio of linear 
combinations of mean squares, the bounds of error on this estimate could 
be exceedingly large. 

The unsuitability of the Medley-Mi tzel model for our data results 
primarily, however, from the fact that it stresses "bo tween-observer'* 
variability rather than "wi th in-observer" variability. This is a 
philosophical rather than a statistical issue. Reliability coefficients 
which reward high agreement between observers implies that we should 
seek a single, uniform, "objective" system for observing and classifying 
teaching behavior. From the point of view of the framework underlying the. 
development of the TPOR, objectivity in perceiving and quantifying such 
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behavior Is neither possible nor desirable. *^'Be tween -observer*' agreement 
may not only encourage a false sense of confidence with respect to the 
accuracy of measurements, but also gives us a false sense of "objectivity!* 
regarding the observations, A team of observers can be brainwashed to the 
point of near-perfect agreement, but this does not erase the possibility 
that instead of several differing "subjective" judgments, they now make 
only one. Therefore, we sought another mathematical definition of 
reliability, one which is concerned primarily with "with in-observer" 
variabi 1 Ity. 

We reasoned that If having scored a given filmed teaching situation, 
the same observer-judge were to score the same teaching situation again 
in the same way, then we could say the observer-judge's scoring was 
reliable. Hence, a definition for "wl th in-observer" rel labi 1 i ty for 
^a given observer-judge and film was devised as follows: 

ing 

2 d| = X|j-X2| 

>‘21 <*I 

Xj2 <>2 

><23 <‘3 



I terns 
1 
2 
3 



Vie 



II 



'22 

'13 



n 



In ><2n <*n 



Consider the variances of the differences d., where 



d. - X, .-X 
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If the scores are independent, i.e., the judge is not consistent, or in 
fact marks by chance, then 

V(d,) = V(x,,-S2j) 

- V(x,.) + VCx^j) 

- 2.2 

- a + o 

= 2a^ (or 2 Var(x)) 

However, If the judge is consistent from viewing to viewing, his 2 scores 
should be positively correlated and now 

V(d.) = V(x^.- x^j) 



” V(X|i) + ” 



2 Cov(x^.,X2.) 



=: 2o^- 2o 



or 



V(d|) 
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s= cf . * 20 - 2 o 
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It Is noted that the following assumptions are made in the above discussion 

1) The variance of each item score Is the same for all items over 
viewings; i.e., 

V(x. .) = o^ for 1=1 ,2 

'J -1 n 

j I # • • n • 

2) Under the ro<iiplote randofmess assumed under chance scoring, each 

value of X is assumed to have equal chance uf being selected; 
hence i 

P(X)=1J 

where k is the number of choices available. 
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Now we define for judge 


J and film f, , . 






Jf 2 




- 


1 

o 

CM 

CM 




where 


Oj = Var(d.) 


i = l . . .n 




a^= Var(x.j) 


i=l,2 
J=1 . . .n 


However, under the assumptions of a random choice 


2 

by tbe judge , a becomes 


a constant, computed as 




o2= 1 (x-p)2p(x) 

X 





V/e calculate the sample value of 
are working with a statistic 




2 

and use it to estimate 
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1 







Hence we 



Now, if there is in fact high positive correlation of the scoring from 

viewing 1 to viewing 2j, then 
2 

Sj will be small (i.e,, s^^ large) 

and 

r.f wi 1 1 be close to 1 . 

If the scoring from viewing to viewing is in Fact .. independent and 



really associated with a chance event, then 
2 . 2 

s. will be of the magnitude of 2a (i.e., s,« will be small; close 

° to zero) 
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The coefficient will theoretically be in the Interval (0,1) 

where a maximum value of one implies absolute correlation, while a minimum 

value of zero implies the same scoring could have happened by chance, 

hence no reliability. However, the possibility of exists because 

there is a non-zero probability that the scorings will be negatively 

2 2 . . 

correlated and this may cause s^ to be greater than a ; this In turn 
causing rj^ < 0. 

Worth mentioning Is the fact that this statistic uses a larger than 
expected variance a , as a yardstick against which the judge*s variation^ 
from viewing to viewing Is compared. This is because one would expect 
a judge to select the extremes in scoring an I tern less frequently than 
scores near the center of the scale; such scoring would likely yield a 
variance smaller than that implied by a completely random selection. 

This yardstick could, in effect, cause the coefficient r^^ to be depressed 
as compared with other measures of reliability. 

Using the above formulation the "wl th in-observer" reliability of 
TPOR scores was computed for the two filmed teaching situations on which 
repeated viewings were made a year apart. Table V shows eight reliability 
coefficients ranging between ,^8 and ,62. 
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TABLE V 

"Wi thin-Observer" Reliability Coefficients for 
TPOR Scores on Repeated Viewings of Films 



TPOR 


FILM NO. 2 
N = 69 

r 




Column 


AL 


error 


TOT 




.0255 


1 


.57 


.0177 


2 


.51 


.0194 


3 


.51 


.0177 




FILM NO. 4 
N = 72 




TPOR 


Co 1 umn 


ilL 


error 


TOT 


.52 


.0191 


1 


.56 


.0182 


2 


.57 


.0244 


3 


.62 


.0171 



These coefficients of reliability, as is the case with those obtained 
using the Medley-Mi tzel model, reflect observer reliability rather than 
instrument reliability. Observer reliability is always subject to varia- 
tions in the selection and training of people and the control of conditions 
under which they use an instrument. People and conditions can be “improved* 
in subsequent studies, but once they are “out," instruments rarely are. So 
it is important to know about the internal consistency of the instrument, 
its item rel iabi 1 i ty— which tells us something of its potential in the 
hands of reliable observers. 
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Table VI shows the results of submitting the film study data to the 
Kuder-Richardson formulation for measuring item reliabiiity. If each item 
is highly correlated with every other item on the instrument, then the 
instrument has good i tern rei iabi I i ty or internal consistency. The fact 
that the TPOR scores yielded uniformly high internal reliability coefficients 
is not surprising in iight of the fact that throughout their development 
the TPOR. TPI, and PBI underwent repeated RAVE analysis, an iterative 
procedure which yields a set of item response weights which maximize the 

internal consistency of inventories. 
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TABLE VI 

TPOR Internal Consistency Reliability Coefficients 
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In summary, we wish to emphasize that the TPOR was developed for 
wide-scale field use by "untrained" observers in the study of teaching 
behavior in relation to philosophic and education beliefs. Instead of 
trying to "train out" the pluralistic biases in the perceptions of our 
observer-judges, we deliberately left them alone, took them Just as they 
came, and tried to include them and take them into account as we analyzed 
the results obtained. This analysis, of course, awaits reporting else- 
where, In this paper we are concerned only with reporting, in the context 
of a discussion of problems involved in defining and measures of reliability, 
the reliability data obtained from experimental use of the TPOR. Having 
submitted this instrument to the hazards of uncontrolled use by uncontrolled 
observers, and then submitting it to the severest statistical procedures 
we could find, it came out with the following score card: (1) Correlation 

of observers' total scores within a given film viewing— VERY GOOD, (2) 
Correlation of observers' total scores between repeat film viewings one 
year apart— POOR to FAIR, (3) Be tween-observer rel iabi 1 i ty— FAIR, (4) 

W1 thln-observer rel iabi 1 1 ty— FAIR, (5) Internal consistency reliability-- 



VERY GOOD. 



